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Abstract 

This paper studies the generalization performance of multi-class classification algorithms, for 
which we obtain—for the first time -a data-dependent generalization error bound with a logarith¬ 
mic dependence on the class size, substantially improving the state-of-the-art linear dependence 
in the existing data-dependent generalization analysis. The theoretical analysis motivates us to 
introduce a new multi-class classification machine based on fg-norm regularization, where the 
parameter p controls the complexity of the corresponding bounds. We derive an efficient opti¬ 
mization algorithm based on Fenchel duality theory. Benchmarks on several real-world datasets 
show that the proposed algorithm can achieve significant accuracy gains over the state of the art. 


1 Introduction 

Typical multi-class application domains such as natural language processing [lj, information re¬ 
trieval 0, image annotation 0 and web advertising Q involve tens or hundreds of thousands of 
classes, and yet these datasets are still growing 0. To handle such learning tasks, it is essential to 
build algorithms that scale favorably with respect to the number of classes. Over the past years, much 
progress in this respect has been achieved on the algorithmic side EH3, including efficient stochastic 
gradient optimization strategies 0. 

Although also theoretical properties such as consistency |pl ITTI| and finite-sample behavior fMT 
m have been studied, there still is a discrepancy between algorithms and theory in the sense that 
the corresponding theoretical bounds do often not scale well with respect to the number of classes. 
This discrepancy occurs the most strongly in research on data-dependent generalization bounds, that 
is, bounds that can measure generalization performance of prediction models purely from the training 
samples, and which thus are very appealing in model selection [16]. A crucial advantage of these 
bounds is that they can better capture the properties of the distribution that has generated the data, 
which can lead to tighter estimates G3 than conservative data-independent bounds. 

To our best knowledge, for multi-class classification, the first data-dependent error bounds were 
give n by I Till . These bounds exhibit a quadratic dependence on the class size and were used by 
[12j and [l8| to derive bounds for kernel-based multi-class classification and multiple kernel learning 

* yunwen. lei@hotmail. com 

1 urundogan@gmail. com 

lalexander.binder@tu-berlin.de 

§ kloft@hu-berlin.de 


1 



problems, respectively. More recently, fl3l] improve the quadratic dependence to a linear dependence 
by introducing a novel surrogate for the multi-class margin that is independent on the true realization 
of the class label. 

However, a heavy dependence on the class size, such as linear or quadratic, implies a poor general¬ 
ization guarantee for large-scale multi-class classification problems with a massive number of classes. 
In this paper, we show data-dependent generalization bounds for multi-class classification problems 
that—for the first time—exhibit a sublinear dependence on the number of classes. Choosing appropri¬ 
ate regularization, this dependence can be as mild as logarithmic. We achieve these improved bounds 
via the use of Gaussian complexities, while previous bounds are based on a well-known structural 
result on Rademacher complexities for classes induced by the maximum operator. The proposed proof 
technique based on Gaussian complexities exploits potential coupling among different components of 
the multi-class classifier, while this fact is ignored by previous analyses. 

The result shows that the generalization ability is strongly impacted by the employed regularization. 
Which motivates us to propose a new learning machine performing block-norm regularization over the 
multi-class components. As a natural choice we investigate here the application of the proven £ p norm 
fl9| . This results in a novel f p -norm multi-class support vector machine (SVM), which contains the 
classical model by Crammer & Singer j2(| as a special case for p = 2. The bounds indicate that the 
parameter p crucially controls the complexity of the resulting prediction models. 

We develop an efficient optimization algorithm for the proposed method based on its Fenchel dual 
representation. We empirically evaluate its effectiveness on several standard benchmarks for multi¬ 
class classification taken from various domains, where the proposed approach significantly outperforms 
the state-of-the-art method of j20| by up to 1%. 

The remainder of this paper is structured as follows. Section [2] introduces the problem setting and 
presents the main theoretical results. Motivated by which we propose a new multi-class classification 
model in Section [3] and give an efficient optimization algorithm based on Fenchel duality theory. In 
Section 0] we evaluate the approach for the application of visual image recognition and on several 
standard benchmark datasets taken from various application domains. Section [5] concludes. 


2 Theory 

2.1 Problem Setting 

This paper considers multi-class classification problems with c > 2 classes. Let X denote the 
input space and y = {1,2,..., c} denote the output space. Assume that we are given a sequence of 
examples S = {(aq, yi ),..., (x n , y n )} £ {X x y) n , independently drawn according to a probability 
measure P defined on the sample space Z = X x y. Based on the training examples S, we wish to 
learn a prediction rule h z from a space H of hypothesis mapping from Z to R and use the mapping 
x —> argmaxygy h z (x, y) to predict. For any hypothesis h £ H, the margin ph(x,y) of the function 
h at a labeled example (x,y) is ph(x,y) —> h(x,y) — max y /^ y h(x, y ). The prediction rule h makes 
an error at (x,y) if ph(x,y) < 0 and thus the expected risk incurred from using h for prediction is 
R{h ) : = ^[L Ph {x,y)<o]- 

2.2 Notation 

Any function h : X x y —> 1R can be equivalently represented by the function vector (hi, ..., h c ) 
with hj(x) = h(x,j),\/j = 1,,.., c. We denote by H := {ph(x, y) : h £ H} the class of margin functions 
associated to H. Let Ffxf->lbea mercer kernel with <f>(x) being the associated feature map, 
i.e., k(x,y) = (4>(x), cj>(y)). We denote by || • ||* the dual norm of || • ||, i.e., ||w||* = sup|| 1S || <1 (u>, w). 
For a convex function /, we denote by /* its Fenchel conjugate, i.e., f*(v) := sup^ic, v) — f(w )]. 
For any w = (wi,..., w c ) we define the £ 2 ,p-norm by ||w|| 2 lP = Ej=i IIII 2 ] 1//p ■ ^ or an Y p > 1, we 
denote by p* the dual exponent of p satisfying 1/p + 1/p* = 1 and p := p (2 — p) _1 . In the remainder 
of the paper, we require the following definitions. 

Definition 1 (Strong Convexity). A function f : X —> R is said to be (3-strongly convex w.r.t. a norm 
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|| • || ijfVx, y £ X and Va £ (0,1), we have 

f(ax + (1 - a)y) < af(x) + (1 - a)f(y) - ^a( 1 - a)||a; - y|| 2 . 

Definition 2 (Regular Loss). We call t a L-regular loss if it satisfies the following properties: 

(i) £(t) bounds the 0-1 loss from above: £{t) > l t <o; 

(ii) t is L-Lipschitz in the sense \£{ti) — £{t 2 )\ < L\t± — ^ 2 1/ 

(in) £(t) is decreasing and it has a zero point ct, i.e., ticfi) = 0. 

Some examples of L-regular loss functions include the hinge £h(t) = (1 — t)+ and the margin loss 

£p{t) = l*<o + (1 — tp 1 )lo<t< P , P > 0. (1) 


2.3 Main results 


Our discussion on data-dependent generalization error bounds is based on the established method¬ 
ology of Rademacher and Gaussian complexities [2lJ . 

Definition 3 (Rademacher and Gaussian Complexity). Let H be a family of real-valued functions 
defined on Z and S = z n ) a fixed sample of size n with elements in Z. Then, the empirical 

Rademacher and Gaussian complexities of H with respect to the sample S are defined by 

^ n i n 

Ms(H) = E ct [ sup -y2<7ih(zi)\, <S s (H) = Eg [sup - gih(zi)] , 
heHn heH n 


where a±,... ,a n are independent random variables with equal probability taking values +1 or — 1, and 
g i,... ,g n are independent N( 0,1) random variables. 


Note that we have the following comparison inequality relating Rademacher and Gaussian com¬ 
plexities 

( 2 ) 

build on 


£%(#) < y 2 0s(i?) - 3 y 2 \fiogn*si H )- 

Existing work on data-dependent generalization bounds for multi-class classifiers ms, 
the following structural result on Rademacher complexities (e.g., [l2|, Lemma 8.1): 


9ts(max{/n,..., h c } : hj e H jt j = 1 .,c) < ^ (3) 

j =i 

where Hi,..., H c are c hypothesis sets. This result is crucial for the standard generalization analysis 
of multi-class classification since the definition of margin involves the maximum operator, which is 
removed by the above lemma, but at the expense of a linear dependency on the number of classes. 
In the following we show that this linear dependency is suboptimal because © does not take into 
account the coupling among different classes. For example, a common regularizer used in multi-class 
classification algorithms is r{h) = J2j=i ll^jlli [III, for which the components hi ,..., h c are correlated 
via alMUo reg ularizer, and the bound Eq. © ignoring this correlation would not be effective in this 
case |12H14 Il8| . 

As a remedy, we here introduce a new structural complexity result on function classes induced by 
general classes via the maximum operator, while allowing to preserve the correlations among differ¬ 
ent components meanwhile. Instead of considering the Rademacher complexity, Lemma [4] concerns 
the structural relationship on Gaussian complexities since it is based on a comparison result among 
different Gaussian processes. 

Lemma 4 (Structural result on Gaussian complexity). Let H be a class of functions defined on X xy 
with y = {1,..., c}. Let g\,... ,g nc be independent N(f), 1) distributed random variables. Then, for 
any sample S = {x \,..., x n } of size n, we have 

^ n c 

©s({ m ax{/ii,..., h c } : h £ H }) < —E g sup EE 9(j-l)n+ihj(Xi), 

n h ^ H i=1 j=1 

where E g denotes the expectation w.r.t. to the Gaussian variables g i,... ,g n c- 
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The proof of Lemma [3] is given in Supplementary Material [A] Equipped with Lemma [H we are 
now able to present a general data-dependent margin-based generalization bound. The proof of the 
following results (Theorem [Sj Theorem [7] and Corollary O is given in Supplementary Material iBl 

Theorem 5 (Data-dependent generalization bound for multi-class classification). Let H C R* xy 
be a hypothesis class with y = {1,..., c}. Let £ be a L-regular loss function and denote Be := 
SVL P(x,y),h ^(Ph{x, y))- Suppose that the examples S = {(xi, y ±),..., (x n , y n )j are independently drawn 
from a probability measure defined on X x y. Then, for any S > 0, with probability at least 1 — 5, the 
following multi-class classification generalization bound holds for any h £ H: 

1 n 2Ly/2ir n c 

R(h) < — ^2 ?{Ph(Xi, Vi)) H--Eg sup EE 9(j— l)n+ihj{Xi) T 3 Bp 

n i=1 n heH i=1 j=1 

where gi, ■ ■ ■ ,g n c are independent N( 0,1) distributed random variables. 

Remark 6. Under the same condition of Theorem 0 |l2| derive the following data-dependent gener¬ 
alization bound: 

1 n . T 

R(h ) < -'y i t(Ph(xi,Vi)) + —*s(Jli(H)) + m 
n z ' n 

i =1 

where IIi(R) := {x —> h(x,y) : y £ y,h £ H}. This linear dependence on c is due to the use of 
Eq. ©. For comparison, Theorem [5] implies that the dependence on the class size is governed by the 
term g(j-i)n+ihj(xi ), an advantage of which is that the components hi,..., h c are jointly 

coupled. As we will see, this allows us to derive an improved result having a favorable dependence on 
c, when a constraint is imposed on (hi,..., h c ). □ 

The following Theorem [7] applies the general result in Theorem [5] to kernel-based methods. The 
hypothesis space is defined by imposing a constraint with a general strongly convex function. 

Theorem 7 (Data-dependent generalization bound for kernel-based multi-class learning algorithms). 
Suppose that the hypothesis space is defined by 




H := Hf\ = {h w = ((wi, ..., (w c , </>(x))) : /(w) < A}, 

where f is a /3-strongly convex function w.r.t. a norm || • || defined on H satisfying /*(0) = 0. Let 
£ be a L-regular loss function and denote Bi := sup( x y ) h £(ph(x,y)). Let gi,... ,g nc be independent 
N(f ), 1) distributed random variables. Then, for any 6 > 0, with probability at least 1 — S we have 


4L 


R{K")<-Y i t(j> h -(x i ,y i )) + — 

n z ' n 


i=l 


N 


7 tA 


yE 9 E \\{gi<t>{xi),g n+i (j)(xi),... ,g( c _i) n+i <!){xi))\\l+ZBe.\ 


2=1 


/ lp gf 

2 n 


We now consider the following specific hypothesis spaces using a || • || 2 ,p constraint: 

H p .a := {h w = ((wi,^(x)),...,(w c ,^(x))) : ||w|| 2 ,p < A}, 1 < p < 2. (5) 


Corollary 8 (£ p -norm multi-class SVM generalization bound). Let £ be a L-regular loss function 
and denote Be := sup( x y ) h £(ph{x,y)). Then, with probability at least 1 — 5, for any h w £ H Pt a the 
generalization error R(h w ) can be upper bounded by: 


iv 

n ' 

i— 1 


£{ph^{xi,yi)) + 3 Be\ 


/log 


2 n 


+ 


2LA 


E 

i 




k(xi,Xi) x 


f v /e(41offc'l 1+21o Bc 


l(^) 


og c) 
2-1 


C P 


if v < 21 °g c 

vr - 2 log c- 1 ’ 

otherwise. 


Remark 9. The bounds in Corollary [5] enjoy a mild dependence on the number of classes. The 
dependence is polynomial with exponent for 9 j < p < 2 and becomes logarithmic if 1 < 

p < 2 Lgc -i • Which is substantially milder than the quadratic dependence established in EMI EH 
and the linear dependence established in (Tf| . Our generalization bound is data-dependent and shows 
clearly how the margin would affect the generalization performance (when £ is the margin loss £ p ): 
a large margin p would increase the empirical error while decrease the model’s complexity, and vice 
versa. □ 
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2.4 Comparison of the Achieved Bounds to the State of the Art 

Related work on data-independent bounds. The large body of theoretical work on multi¬ 
class learning considers data-independent bounds. Based on the t'oo-covering number bound of linear 
operators, [15J obtain a generalization bound exhibiting a linear dependence on the class size, which is 
improved by Q to a radical dependence of the form 0(n~i( log 5 n)^-). Under conditions analogous 
to Corollary [51 [23| derive a class-size independent generalization guarantee. However, their bound 
is based on a delicate definition of margin, which is why it is commonly not used in the mainstream 
multi-class literature. jl| derive the following generalization bound 


E 


\ k) g (i+E 


,P(P~( Wy-W 5 ,0(x)}) 


y^y 


< inf 
w£ H 


^i°g(i+E' 


,p( P -(Wy-V/y,<l>(x))) 


v¥=v 


+ 


An 


2(n + 1) 


2su Pxg x k (x,x) 
An 


, (6) 


where p is a margin condition, p > 0 a scaling factor, and A a regularization parameter. Eq. ED is 
class-size independent, yet Corollary [5] shows superiority in the following aspects: first, for SVMs (i.e., 
margin loss £ p ), our bound consists of an empirical error (— Xn=i ^p(Ph™(xi, Vi))) and a complexity 
term divided by the margin value (note that L = 1/p in Corollary[SD. When the margin is large (which 
is often desirable) 0 , the last term in the bound given by Corollary [8] becomes small, while—on the 
contrary—-the bound ED is an increasing function of p , which is undesirable. Secondly, Theorem [3 
applies to general loss functions, expressed through a strongly convex function over a general hypothesis 
space, while the bound ED only applies to a specific regularization algorithm. Lastly, all the above 
mentioned results are conservative data-independent estimates. 

Related work on data-dependent bounds. The techniques used in above mentioned papers 
do not straightforward translate to data-dependent bounds, which is the type of bounds in the focus 
of the present work. The investigation of these was initiated, to our best knowledge, by 1141: with the 
structural complexity bound EJ for function classes induced via the maximal operator, [14l | derive a 
margin bound admitting a quadratic dependency on the number of classes. [12j use these results in 
{l4l | to study the generalization performance of multi-class SVMs, where the components hi,...,h c 
are coupled with an || • || 2 , p ,p > 1 constraint. Due to the usage of the suboptimal Eq. @, |12| obtain 
a margin bound growing quadratically w.r.t. the number of classes. (l8| develop a new multi-class 
classification algorithm based on a natural notion called the multi-class margin of a kernel. [l8| also 
present a novel multi-class Rademacher complexity margin bound based on Eq. ED> and the bound 
also depends quadratically on the class size. More recently, [lfi] give a refined Rademacher complexity 
bound for multi-class classification with a linear dependence on the class size. The key reason for 
this improvement is the introduction of pg^ '■= min^g-y [h{x, y) — h(x,y ) + 01 < ] bounding margin 

Ph from below, and since the maximum operation in pg^ is applied to the set y rather than the 
subset y — {yi} for py L1 one needs not to consider the random realization of yi. We also use this trick 
in our proof of Theorem [5] However, (l3| failed to improve this linear dependence to a logarithmic 
dependence, as we achieved in Corollary [51 due to the use of the suboptimal structural result ([3]). 


3 Algorithms 

Motivated by the generalization analysis given in Section [2j we now present a new multi-class 
learning algorithm, based on performing empirical risk minimization in the hypothesis space ED- This 
corresponds to the following £ p -norm multi-class SVM (p > 1): 

Problem 10 (Primal problem: £ p -norm multi-class SVM). 


3 =1 

S.t. ti = (W yi ,<f>(Xi)) 


2=1 

max(w y , 4>(xi)), 
y=£yi 


(P) 
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For p = 2 we recover the seminal multi-class algorithm by Crammer & Singer [2G|, which is thus 
a special case of the proposed formulation. An advantage of the proposed approach over [20| can 
be that, as shown in Corollary [51 the dependence of the generalization performance on the class size 
becomes milder as p decreases to 1 . 

3.1 Dual problems 

Since the optimization problem dPj is convex, we can derive the associated dual problem for the 
construction of efficient optimization algorithms. The derivation of the following dual problem is 
deferred to Supplementary Material[Cj For a matrix a £ R raxc , we denote by the ith row. Denote 
by e ; the j-th unit vector in R c and 1 the vector in R c with all components being zero. 

Problem 11 (Completely dualized problem for general loss functions). The Lagrangian dual problem 
of (fill is: 


. - 2(P~1) 

alEIIE^Mir] p -cEn- 

j= 1 2=1 2—1 

s.t. ot-ij <0 A cxi • 1 = 0, Wj 7 ^ yi, i = 1, ..., n. 


sup — 

oceR nxc 


OLt 


Wi ' 

C ■ 


(D) 


Theorem 12 (Representer theorem). For any dual variable a £ M nxc , the associated primal 
variable w = (wi,..., w c ) minimizing the Lagrangian saddle problem can be represented by: 


[Ei 

j=i 


E a i^( a 


E< 


i<t>{Xi) II 2 


For the hinge loss £h(t) = (1—f) + , we know its Fenchel-Legendre conjugate is £^(t) = t if — 1 < t < 0 
and 00 elsewise. Hence if —1 < < 0 and 00 elsewise. Now we have the 

following dual problem for the hinge loss function: 

Problem 13 (Completely dualized problem for the hinge loss (t^-norm multi-class SVM)). 


sup — 

a6l"« 




E || E 

1=1 i=1 


2(p — 1) 


aij4>{xi)\\y 1 


7 , a iVi 

i=l 


s.t. a.i < e yi ■ C A a, ■ 1 = 0, Vi = 1,..., n. 


(7) 


3.2 Optimization Algorithms 

The dual problems m and ([7]) are not quadratic programs for p ^ 2 , and thus generally not easy 
to solve. To circumvent this difficulty, we rewrite Problem [TO] as the following equivalent problem: 


mm 
MV, (3 


E- 

1=1 


2 /?i 


2=1 


s.t. ti < (w yi ,4>(xi)) - fw y ,(j)(xi)), yj£yi,i = l,...,n, 

||/3||p < l,p=p(2-p) _1 ,/3j > 0. 


( 8 ) 


The class weights /3i,... ,/3 c in Eq. ([5]) play a similar role as the kernel weights in £ p -norm multiple 
kernel learning (MKL) algorithms [19|. The equivalence between problem (|P]) and Eq. © follows 
directly from Lemma 26 in [24| . which shows that the optimal (3 = ..., f3 c ) in Eq. © can be 

explicitly represented in closed form. Motivated by the recent work on £ p -norm MKL, we propose to 
solve the problem © via alternately optimizing w and f3. As we will show, given temporarily fixed 
/3, the optimization of w reduces to a standard multi-class classification problem. Furthermore, the 
update of (3, given fixed w, can be achieved via an analytic formula. 
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Problem 14 (Partially dualized problem for a general loss). For fixed (3, the partial dual problem for 
the sub-optimization problem (JHJ) w.r.t. w is 




sup 

ael" xc 


1=1 


i=l 


2—1 


c 


s.t. ctij <0 A a.i • 1 = 0, Mj ^ yi, i = 1,..., n. 
The primal variable w minimizing the associated Lagrangian saddle problem is 


(9) 


n 

(i°) 

2=1 

We defer the proof to Supplementary Material [Cj Analogous to Problem [13j we have the following 
partial dual problem for the hinge loss. 

Problem 15 (Partially dualized problem for the hinge loss (£ p -norm multi-class SVM)). 


sup /(a) 

aGt nXc 


^ c n n 

~ 2 I 1 ) 112 + 

J = 1 2—1 2—1 


s.t. a.i < e Vi ■ C A a,; • 1 = 0, Vi = 1,..., n. 


( 11 ) 


The Problems [14] and [15] are quadratic, so we can use the dual coordinate ascent algorithm (25] to 
very efficiently solve them for the case of linear kernels. To this end, we need to compute the gradient 
and solve the restricted problem of optimizing only one at, Vi, keeping all other dual variables fixed (25| . 
The gradient of / can be exactly represented by w: 


df 

dotij 


n 

~Pj Y a ij k ( X i’ X l) + 1 Vt=j = 1 W =J - (™jA( x i))- 
2=1 


( 12 ) 


Suppose the additive change to be applied to the current a* is Sat , then 


f (p 15 * • ’ 5 ^2—1 1 “1“ ^^2 5 ^2+1 5 • • • 5 ^n) 

c n i c 

= - y ^Pj Y. a ii( a ij + Saij)k(xi,xi) - 2 y ^Pj[fiaij] 2 k(xi,Xi) + Sa iyi + const 
1=1 2=1 1=1 

^ d f 1 ^ 

= Y. ~ 2 YPjH x u x i)[$ a ij] 2 + const. 

1 =i 3 i=i 

Therefore, the sub-problem of optimizing is given by 

1 ^ ^ d f 

max - /3jk(xi,Xi)[5aij] 2 + Y —P—dotij 

Sai da H (13) 

s.t. Soti < e Vi ■ C — cx.i A Sa.i -1 = 0. 

We now consider the subproblem of updating class weights /3 with temporarily fixed w , for which we 
have the following analytic solution. The proof is deferred to the Supplementary Material 1C.II 

Proposition 16. (Solving the subproblem with respect to the class weights) Given fixed w j, the min¬ 
imal (3j optimizing the problem (0 is attained at 

, c X 

Pi = H w tll2 _P ^X]ll w jll2j • (14) 
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The update of fij based on Eq. (fTH) requires calculating ||wj|||, which can be easily fulfilled by 
recalling the representation established in Eq. USD- 

The resulting training algorithm for the proposed l p -norm multi-class SVM is given Algorithm [1] 
The algorithm alternates between solving a multi-class SVM problem for fixed class weights (Line 3) 
and updating the class weights in a closed-form manner (Line 5). Recall that Problem fill establishes 
a completely dualized problem, which can be used as a sound stopping and evaluation criterion for 
the optimization algorithm. 


Algorithm 1: Training algorithm for t^-norm multi-class classification, 
input: examples {(xi, j/i)7 =1 } and the kernel k. 

1 initialize fij = wj = 0 for all j = 1 ,,c 

2 while Optimality conditions are not satisfied do 

3 optimize the multi-class classification problem ([9j 

4 compute 11 Wy 111 for all j = 1,..., c, according to Eq. m 

5 update fij for all j = 1,..., c, according to Eq. Cl 

6 end 


4 Empirical Analysis 

We implemented the proposed £ p -norm multi-class SVM algorithm (Algorithm [1 in C++ and 
solved the involved MC-SVM problem using dual coordinate ascent j25|. We experiment on three 
benchmark datasets: the Sector dataset studied in [26], the News 20 dataset collected and originally 
used for text classification by (27j, and the Rcvl dataset collected by [28|. Table Q] gives a description 
of these datasets. 


Dataset 

No. of Classes 

No. of Training Examples 

No. of Test Examples 

No. of Attributes 

Sector 

105 

6,412 

3,207 

55,197 

News 20 

20 

15,935 

3,993 

62, 060 

Rcvl 

53 

15, 564 

518,571 

47, 236 


Table 1: Description of datasets used in the experiments. 


Method / Dataset 

Sector 

News 20 

Rcvl 

4-norm MC-SVM 

94.20 ± 0.34 

86.19 ± 0.12 

85.74 ± 0.71 

Crammer & Singer 

93.89 ± 0.27 

85.12 + 0.29 

85.21 + 0.32 


Table 2: Test set accuracies achieved by the classical Crammer & Singer and the proposed £ p -norm 
multi-class SVM on the benchmark datasets. 

We compare with the classical multi-class classification algorithm proposed by Crammer & Singer 
j20|, which constitutes strong baseline for these datasets [25]. We employ a 5-fold cross validation on 
the training set to tune the regularization parameter C by grid search over the set {2 -12 , 2 -1 ,..., 2 12 } 
and the parameter p from the interval [1.2,1.25,..., 10]. For the parameter p we first use a larger 
grid of step size 0.5 and then a finer grid of step size 0.1 around the optimum. Note that the model 
parameters are tuned separately for each training set and only based on the training set, not the test 
set. We repeat the experiments 10 times, and report in Table[2]on the average accuracy and standard 
deviations attained on the test set. 

We observe that the proposed £ p -norm MC-SVM consistently outperforms the method by Crammer 
& Singer [20] on all considered datasets. Specifically, our method attains 0.31% accuracy gain on 
Sector, 1.07% accuracy gain on News 20, and 0.53% accuracy gain on Rcvl. These promising results 
indicate that the proposed £ p -norm multiclass SVM could further lift the state of the art in multi-class 
classification, even in real-world applications beyond the ones studied in this paper. 




















5 Conclusion 


Motivated by the ever growing size of multi-class datasets in real-world applications such as image 
annotation and web advertising, which involve tens or hundreds of thousands of classes, we studied 
the influence of the class size on the generalization behavior of multi-class classifiers. We focus here on 
data-dependent generalization bounds enjoying the ability to capture the properties of the distribution 
that has generated the data. Of independent interest, for hypothesis classes that are given as a 
maximum over base classes, we developed a new structural result on Gaussian complexities that is 
able to preserve the coupling among different components, while the existing structural results ignore 
this coupling and may yield suboptimal generalization bounds. We applied the new structural result 
to study learning rates for multi-class classifiers, and derived, for the first time, a data-dependent 
bound with a logarithmic dependence on the class size, which substantially outperforms the linear 
dependence in the state-of-the-art data-dependent generalization bounds. 

Motivated by the theoretical analysis, we proposed a novel L'p-norm regularized multi-class support 
vector machine, where the parameter p controls the complexity of the corresponding bounds. This class 
of algorithms contains the classical model by Crammer & Singer [20j as a special case for p = 2. We 
developed an effective optimization algorithm based on the Fenchel dual representation. For several 
standard benchmarks for multi-class classification taken from various domains, the proposed approach 
surpassed the state-of-the-art method of Crammer & Singer [20j|, by up to 1%. 

An exciting future direction will be to derive a data-dependent bound that is completely indepen¬ 
dent of the class size (even overcoming the mild logarithmic dependence of our bounds). To this end, 
we will study more powerful structural results than Lemma 0] for controlling complexities of function 
classes induced via the maximum operator. As a good starting point to this end, we will consider 
^oo-covering numbers. 
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Supplementary Material 

A Proofs on Structural Results on Gaussian Complexity 

Our discussion on complexity bound is based on the following comparison results among different 
Gaussian processes. 

Lemma A.l (Theorem 1 in (29j). Let {Xg : 6 £ 0} and {2 )g : 6 £ 0} be two non-zero real-valued 
Gaussian processes indexed by the same countable set O and suppose that 

E[(2J, - 2) e -) 2 ] <E[(X e -X e -) 2 ], VMee. (A.l) 


Then, 


E[sup2) fl ] < E[supX e ]. 
8 8 


Proof of Lemma [4] Define two Gaussian processes indexed by H (for any h £ H, we use here the 
equivalent representation h = (hi ,..., h c )): 


X h := 22 [gi^x{hi(xi),h 2 (xi),...,h c (xi)}], 
n c 

Vh ■= EE 9{j — l)n+ihj (%i), V/l £ H. 

*=1 3 = 1 

For any h = (hi ,..., h c ), h = (hi ,..., h c ) £ H, the independence of the gi and the equalities Eg 2 = 1 
imply that 

n 

E[(X?! - X^) 2 ] = 22 [max{fti(a:i),... ,h c (xi)} - max{fti(ij ),... ,h c (xi)}] 

i=l 

n 

E[(2)/, - Z)h) 2 } = E [(^i (®») - hi(Xi)) 2 H-h (h c (xi) - h c (xi)) 2 ]. 

For any a = (ai ,..., a c ), h = (bi,..., b c ) £ R c , it can be directly checked that 

C 

| max{ai, ...,a c } - max{6i,..., b c }\ < max{|ai - 6i|,..., | a c — b c \} < ^ |a» - bi\. 

i =1 

Applying the above inequality with a = (hi(xi),... ,h c (xi)),b = (hi(xi),..., h c (xi)), i = 1 
yields directly the following bounds relating the increments of the two Gaussian processes Xh,%)h : 

n 

E[(Xh - X^) 2 ] < 22 max{| hi(xi) - hi(a;i)|,..., | h c (xi) - h c (xi)\} 2 

i= 1 
n 

= 22 max{| hi(xi) - hi(xi)\ 2 ,..., \h c (xi) - h c (xi)\ 2 } 

i= 1 
n c 

^EE I hi(xi) - hj(xi) I 2 = E[(% h - 2)^) 2 ], V/i, h £ H. 

,=i j =i 

That is, the condition EU> holds and therefore Lemma ED can be applied here to yield the stated 
result. □ 

The following lemma gives a general Gaussian complexity bound for hypothesis spaces used in 
multi-class classification. 
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Lemma A.2 (Gaussian complexity of multi-class hypothesis spaces). Let H be a class of functions 
defined on X x y with y = {1,..., c}. Let S = {(x\,yi ),..., (x n , y n )} he a sequence of examples. 
Let <71 ,..., gw be independent N(0, 1) distributed random variables. Then the empirical Gaussian 
complexity of H can be controlled by: 


e s(H) < -Eg 9(j—l)n+ihj{Xi). 

^ h£H 


*= 1 3=1 

Proof. Define two Gaussian processes indexed by H: 


^ ' di^Vi (Xj), ^ ' 9(j— l)n+jhj {Xj ) ; C H. 

2 = 1 2=1 j =1 


For any h, h € H , it is obvious that 

n 

E [(£fc - %) 2 ] = i h V*( X i) - h Vi {Xi)f 


2=1 


< ^2 [{hi( Xi ) - hi(xi)) 2 H-h ( h c (xi ) - h c (xi )) 2 ] 

2=1 

= E[(3) h -© R ) 2 ]. 

Now the stated inequality follows directly from Lemma rA.il 


□ 


B Proofs on Generalization bounds For Multi-class Classifiers 

B.l Proof of Theorem [5] 

Proof of Theorem [5] For any 9 > 0, introduce the following function bounding ph{x , y) from below: 

Pe.h(x,y) = K x ,y) ~ ma x[h{x,y ) - 91 > } = min [h(x, y) - h{x,y ) + 91 > ]. 

v ey y ey 

It can be checked that pe,hix : y) = min {ph(x,y),9). Introduce two function classes derived from pgy. 

He = { Pe,h( x ) V) ■ h G H}, Tie = {£{pe,h{x, y)) : h € H}. 

According to the definition of L-regular loss function and the relationship pg^ < ph , we have 
R(h) = E[l Ph(JC , y) < 0] < E[l p „ AXtY) < 0] < E [£(pg th (X,Y))] t 
which, together with McDiarmid inequality (30], yields the following inequality 


1 n \o — 

R(h) <-Y t £(pofiix i ,y i ))+m s m) + 3BJ^p-, 

n i =1 ' 


VheH 


(B.l) 


with probability at least 1 — 6. 

For the fixed parameter 9 = eg, we observe that pg t h(x, y) = mln(ph{x, y),ce). If ph{x , y) > q, the 
definition of L-regular loss implies that 

£{pe,h{x, y)) = t{ci) = 0 = £(ph{x, y)). 

Otherwise, we have pe,h(x,y) = p h {x 7 y). Therefore, for an y (x, y) we have £{pe,h(x,y)) = £(p h (x,y)), 
which, coupled with the Lipschitz property of l and Eq. (EB, yields the following inequality with 
probability at least 1 — <5: 


R(h) < ^Y,l{p h (,Xi, yi )) + 2 UJKsiHe) + 3 L^/^, \/h € H. 

2=1 


2 n 


(B.2) 
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The Rademacher complexity of Hg satisfies the following inequality: 


^s(H e ) = -E CT sup Y ^<Ti(h{xi,yi) - ma x(h(xi,y) - 0l y=Vi )) 

n > - tt *—' y^y 


n L fce»7=l 


^ 71 ^ 71 

< -E CT [sup y~]<Tih(xi,yi)] + -E a sup V' a t max(h(xi, y) - 01 y=Vi ) 

n h&H iZ i n lh ^ H iZ i vGy 

1 [ 7 r 


(B.3) 


< \/-®s(H) + - J g SU P y E m&x{hi{xj) - 01 w=1 ,... ,h c {xi) - 6l Vi=c ) 


where the last step follows from the relationship between Gaussian and Rademacher processes ex¬ 
pressed in Eq. ©• Furthermore, according to Lemma [H the last term of the above inequality can be 
addressed by 

n 

Eg[sup y gi max.{h\{xi) - 01 yi =i, h 2 (xi ) - 01 yi= 2 ,..., h c (xi) - 01 yi=c }] 

Lem. ^ 

— Eg SUp y ^ | gi(hi(Xi) 01y^ = l) T Qn-\-i (h 2 (.Xi ) 01yi==2) "f" * * * T g(c-l)n-\-i{hc{Xi) 01y i =c)J 

hGH i=l 
n 

— Eg SUp y \g%h\{xfj T (^z) T * * * T 9(c— l)n+z0 , c(*^z)] 

h ^ H 7^i 

n 

^ 1 — l T ‘ ‘ ‘ T <7(c-l)n+i01yi=c] 

i=l 

n 

= Eg sup y [gihi^Xi) + gn+ih 2 {xi) H-1- g( c -i)n+ih c {xi)\ ■ 

With this inequality and using Lemma lA.21 to tackle &s(H), we immediately derive the following 
bound on 9\s(Hg): 


^s(Hg) < Eg sup EE 9(j-l)n+ihj(Xi). 

n h ^ H i=1 j =1 

Putting this Rademacher complexity bound back into Eq. (IB.2II . we obtain the stated result. □ 


B.2 Proof of Theorem [3 

To apply Theorem 0 we need to control the term sup heH 5Z;=i fJ{j-i)n+if l ]{ x i)-> which we 
tackle by the following lemma due to (tO] . 

Lemma B.l (Corollary 4 in [3l|b If f is (3-strongly convex w.r.t. || • || and /*(0) = 0, then, for any 
sequence Vi,... ,v n and for any qi we have 

n n 1 n 

E<^> " /(^) - + ^E INI*’ 

i—1 i— 1 ^ i= 1 

where Vi-.i denotes the sum 2j=i v j- 

Proof of Theorem 0 For the hypothesis space H and any A > 0, applying Lemma IB. II with /x = 
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(wi,..., w c ) and v t = A (gi^(xi), g n+i (j>(xi ),..., g( c -i)n+i<l>(xi)), we have 


n c n c 

A sup XT g(j-l)n+ihj {Xi) = sup XT 9(j-l)n-\-i j i ^0(^2 )) 

/i w £H y j —/i w £H ^—y j—\ 


= sup EX W1 ’--- i w c)} (A Qi(j)(Xi)) ^9n-\-i ( f > (%i) ■> • ■ • j ^9(c— l)n+i 0 (*^i))) 
i=1 

n A 2 n 

< SUP r /(wi, . . . , W c ) + 5^<V/*(ui:i_i), Ui) + — ^ ||(ffi</>(a;i), Sn+i#®*)) ■ • • , fl , (c-l)n-M‘K a: i))ll* 


h w eH 


Taking expectation on both sides w.r.t. the Gaussian variables < 71 ,..., g nc , the term (V/*(ui ; i_i), v$) 
vanishes, and therefore we obtain 


sup XT 9(j-l)n-\-ih , j 

' ,W£ff i=i i=i 


A A n 

i) < J + l^^^g\\{9i<t>{Xi),gn+i(l>{Xi),...,g( c - 1)n+i <t){Xi))\\i 


Choosing A 


2/3A 


the above inequality translates to 


E fl sup XE 9(j-l)n+ihj ( Xi) < 


h™£H , ■ , 
i=l j=l 


2A 


a o '^2 E ‘g\\(9i ( l ) (xi),gn+i(l)(xi),...,g {c _ 1)n+i (l)(xi))\\t 
\ *=1 


Putting the above complexity bound into Theorem [5] we obtain the stated result. 


□ 


B.3 Proof on f p -norm Multi-class Classification Generalization bounds (Corol¬ 
lary ED 

The following simple lemma controls the p-th moment of a N(0, 1) distributed random variable. 

We give the proof here for completeness. 

Lemma B.2. Let g be N( 0,1) distributed. For any p > 0, the p-th moment of g can be bounded by 

[E|p| p ]p < ( 2 p )2 + P. 


Proof. Let Vn 6 N+ : T(n) = (n — 1)! be the Gamma function. The p-th moment of a N( 0,1) 
distributed random variable can be exactly expressed via Gamma function |.32lj : 


2 2 r T) — In, 2 2 J —— p — 1 r p-l -i I 1 
= ! < r_H1+5 

v 7T 2 1 / 7 r 2 

< ( 2 p) f+1 , 

where in the above deduction we have used Stirling’s approximation [.33l | : 


n! < V2nn n+ h~ n+1/{12n \ 


□ 

Proof of Corollary [5] Let g i,..., g n c be independent 1V(0,1) distributed random variables. Denote 
by t s = [E|pi| s ]“ the sth moment of a N(0, 1) distributed random variable. Let q be any number 
satisfying p < q < 2. Introduce the function f q { w) := A||w||| q . Any h w £ Hq,A satisfies the inequality 

fgM = \\M\lq < \ A 2 - 

14 










Since f q ( w) is l/g*-strongly convex w.r.t. the norm || • || 2 , q , and the dual norm of || • || 2 , g is || • || 2 ,q* [Hi, 
the summation of the squared dual norm in Theorem [7] can be rewritten as follows: 


XX I = XX [X 

i= 1 2=1 j =1 


^ ^ — l)n+i ] ’ k(Xi,Xi) 

1=1 j=l 


symmetry 


= ll ‘ V E atX ^ ] ’* X *(**■**) 

1=1 i=l 

Jensen _j_ 

< c g * t q , } j k(xi,Xj). 


From which Theorem [7] immediately implies the following bounds, with probability at least 1 — 5 and 
for any h w G Hq,A- 


R(h w ) < - X (*i. 1/i)) + 4LAcl/g V 

77. ^ 77, 


2=1 


% * ^ 

a yX(^,a:i) + 3£^ 


'logj; 

2n 


From the trivial inequality ||w|| 2 ,p > ||w|| 2 ,g, we immediately conclude H p \ C H q a- Therefore, for 
any h w G iJ p . a, we have 


"I ^ AT A 1 Z/7* 

< -X^(*i»yO)+ inf --- !£- 

n *■—' p< 9<2 n 

2=1 


1 


^X *(**•**) + 


/ lp gf 

2n 


It can be directly checked that the function f —► y/tc 1 ^ is decreasing along the interval (0, 21ogc) 
and increasing along the interval (2 log c, 00 ). Therefore, the above generalization bound satisfies the 
inequality 


R{h w ) < - 
n 


X 


i{ph^{xi,yi)) + 3 B e 



LA 

~\ 


n 

8X k(xi,Xi) x 

1=1 


■\/2e log cr 2 logo 

P-1 

C P T_L 


if p < 


2 log c 
2 log c— 1 5 


otherwise. 


Applying Lemma fB.21 to bound the moments of Gaussian variables, the stated result follows immedi¬ 
ately. □ 


C Proofs on the Dual Problems 

C.l Equivalent Representation of f p -norm Multi-class Classification 

The equivalence between Problem |P]) and Eq. ([5J follows directly from the following lemma due 

to [13- 

Lemma C.l (0). Let at > 0, i G N<j and 1 < r < 00 . Then 


. \ ^ ® 2 

mm x 

vvi>o,IZn 


Llll ? 



1+7 
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and the minimum is attained at 


Vi = 


(SfceN d a k + ) 


Proof of Proposition fT6l Fixing w, the sub-optimization of Eq. (JHJ) w.r.t. (3 is 


w,- 


min > 

J — l J 


J\\2 


s.t.\\(3\\ p <l,p = p{2-p) ,/3j>0. 

The stated result now follows directly by applying Lemma [C M I with r = p and otj = ||wj|||. 

C.2 Derivation of the Completely Dualized Problem (Problem 1111) 

Derivation of Problem 1111 Problem JP]) translates to the following equivalent problem 


□ 


1 ° 

5[£l 


v j 11 2 j + c e £(tj) 

1=i i=i 

s.t. ti < (wj ti ,(j)(xi)) - (w y ,4>(xi)), y^yi,i = l,...,n. 


(C.l) 


\ vv y 

The Lagrangian of the above convex optimization problem is 


C = 


1 c 2 n n 

9 [E P + C J2^ + Y1 E + (w j,<t>(Xi)) - (w yi , , 


1=1 


1 = 1 j^Vi 


with Lagrangian variables 0 < a £ R” x ( c - 1 ). For the last term of the Lagrangian, we have the 
following identity: 

n n n 

E E 5 *l( W l - W W> <t>{Xi)) = E E Uij(™3><l>(Xi)) ~ E E 

i=1 j^Vi i=1 j^Vi *=1 


= E< w p 55 

1=1 1=1 *:j/i=! 

C 

= E( w p E & a4>( x i)- E E 5 ii^ Xi ))- 

1=1 i:«</l j^j 


(C.2) 


With this identity, the Lagrangian translates to 
1 


£ = 


[D 

i=i 


v 3 112 


E^ w p E OHj<t>(Xi) - EE «i#( a; i)) + 


1=1 


i:y»#l 




cE^o + ^E 5 ^- ( c - 3 ) 
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According to the definition of Fenchel conjugate function, it holds that 


inf C = — sup 

w,t w 


i c 

5 El 

3 =1 


W 


3 112 


p -E( w a ^ ^(xi)- y 

3=1 i:yt=3 jjtj 


~ C^sup[-£(^) - Y cUijti] 


i— 1 


&Vi 


(- E E 5 ^ 3 






\ c 

2 

jj=l 

2,P- 


(C.4) 


- c E f (-cE 5 ») 


i=l 


j^Vi 


( 51 5 «^(*o- 51 E 5 ^**)) ■ =1 | 2 _ L _c '5Z r ( _ ^Z!“v)» 

i-y%=i 3 ,p_1 *=i 

where in the last step of the above deduction we have used the identity: (A|| • || 2 )* = i|| • || 2 and the 
fact that the dual norm of || • || 2 , p is || • || 2 ._e_. Consequently, the dual problem becomes 

c 2(p-l) n , 

[5111 E & io^( x i)~ Y E^MIr 1 ] P -<?E r (-^E 5 b)> 


1 

su p - „ 

sepxi'- 1 ) z 


3 = 1 i'-yt^j 

s.t. <5 > 0. 

Introducing a £ R nxc via the substitution: 


»=l 


j^Vi 


da — 


(C.5) 


-Ay if j ^ yi 

®ij ^ 3 ~~ 

we have 

Y Uij&iXi)- Y E 5 ^) = - E a i3<t>{ x i)- Y a i3<t>( X i)i ( C - 6 ) 

i-Vi=3 i-Vi=3 

from which the stated dual problem follows directly. □ 


C.3 Proof of the Representer Theorem (Theorem 11211 

Let H \,..., H c be c Hilbert spaces and p > 1. Define the function g p {v \,..., v c ) : Hi x • • • x H c 


by 


g p (vi,...,v c ) = -||(ui,...,u c )||i,p, P> 1- 


Lemma C.2. The gradient of g p is 

dg P (v i, • ■ • ,v c ) 

dvj 

Proof. By the chain rule, we have 

dg P {v i, ...,v c ) _ 1 r 


= [E 

j=i 


^j 112 ] II II2 E 


dvi 


:e i 

5=i 




f-i 


3-1 


dv 3 


= [Ei 


j 112 J 


l_1 iMirv 


3=1 


□ 
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Proof of Representer Theorem (Theorem 1121) . In our derivation of the dual problem (see Eq. 
m), the variable w should meet the optimality in the sense that 

^ c c n 

1 2 

w = argmax —-[^ living * + E< V J> E a ij<f>( x i))- 

3 =1 3=1 i =1 


Since (v/) 1 = V/* for an y convex function /, and the Fenchel-conjugate of g p is g p », we obtain the 
following representation of w: 


n n 

w = V9 P 1 ( E an(j>(xi), ■ ■ ■, E a ic&{xi)) 

i= 1 2—1 

n n 

= V5p* ( 51 • • • > E 

2=1 2=1 

c n n n n n 

= ]^" 1 (llE a ^)ll2 2 [E a ^)]>---IIE a ^)ll2 2 [E a -^)])- 

J = 1 2=1 2=1 2=1 2=1 2=1 


That is, 

c n n n 

w 3 = [E II E ^3^)112 ]** 1 \\^2 a H ( t ) ( X i)\\ P 2 2 [E “«*(*<)]• 

J = X 2=1 2=1 2=1 

□ 


C.4 Derivation of Partially Dualized Problem (Problem I14j) 

Derivation of Problem 1141 The Lagrangian of the problem (JHJ) w.r.t. w is 

C 11 112 22 22 

£ = E~^y +C, E^)+E E 5i3^ + < w 3,^)) - ( w yi 

3=1 J *=i ®=i 3¥y. 

with Lagrangian variables 0 < ot. £ K nx ( c_1 ). 

According to the identity (1C.2II , the Lagrangian translates to 


c = E + E< w i- E airfixi) - E E%^)> + C E^) + ^ E ^*<1- ( CJ ) 

1=1 J 3 = 1 i-V&j 


3 = 1 ' J 3 = 1 i'-Vi=j j^ij i= 1 

According to the definition of Fenchel conjugate function, it holds that 


i^Vi 


c i ^ 

in £ £ = - E t Sup [ - 9 H W 1 Hi - < W 3> & ( E E E«b^)))] 

W ' t ■ ,Pj Wj z , . 

3=1 J *:j/i#3 l -Vi = J jjkj 

n 1 

-CEsiipKt) - E 


j^Vi 


c 1 ^ n -j 

3 = 1 3 i-Vi=j ijtj *= 1 J^Vi 


= -jE^'l E E E 5 ^**) 0 -Cf E**( - t; E 5 i i)> 


2 / -/ '“j | ^-Tjrrv 2 ~ Z—f ” v (J 

3=1 i-Vi^O i'-Vi—i jjtj *=1 3#y» 


*:yi=3 j# 

where in the last step of the above deduction we have used the identity: (4|| • || 2 )* = 5 II • ||* and the 
fact that the dual norm of || • || 2,2 is itself. Consequently, the dual problem becomes 


SUP 0 

agRnXto-i) ^ 


-JE^'I E E E 5 «^) 0 - c E r (-^E^)> 


3=1 

S.t. CK > 0 . 


4: 2/*=3 


i=l 


3'/l/i 
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Introducing a. G R" xc as in Eq. (1C.51) and noticing the identity (1C.61) . the above dual problem becomes 


1 c n n 

sup - 

2 J=1 i =1 i=1 ° 

c 

s.t. aij =0, Vi = 1, 2,..., n, 

3 =1 

< 0, j ^yi,\/i=l,...,n. 


(C.8) 


Note that in the above derivation of the dual problem, the variable w should meet the optimality in 
the sense that 

^ c c n 

w = &rgmax--J2\\v j \\l + J2l 3 j(vj,J2a i j(l)(xi)). 

3—1 j=l *=1 

The representer theorem stated in Problem [IT] follows directly from this optimization condition. □ 
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